Using Comparable Corpora to Adapt a Translation Model to Domains

نویسندگان

  • Hiroyuki Kaji
  • Takashi Tsunakawa
  • Daisuke Okada
چکیده

Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains. To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the words associated with a source-language word, presently restricted to a noun, and its translations; word translation pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the parameters and an extension to estimate translation pseudo-probabilities for verbs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Collecting and Using Comparable Corpora for Statistical Machine Translation

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional dat...

متن کامل

Creation of comparable corpora for English-Urdu, Arabic, Persian

Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages or some specific domains, parallel corpora are not readily available. This leads to under-performing machine translation systems in those sparse data settings. To overcome the low availability of parallel resources the machine translation community has rec...

متن کامل

Language and Translation Model Adaptation using Comparable Corpora

Traditionally, statistical machine translation systems have relied on parallel bi-lingual data to train a translation model. While bi-lingual parallel data are expensive to generate, monolingual data are relatively common. Yet monolingual data have been under-utilized, having been used primarily for training a language model in the target language. This paper describes a novel method for utiliz...

متن کامل

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010